Introduction: Why Code for Data Science?
Phil Chodrow
Tuesday, August 27th, 2019
Some Things That Aren’t Data Science
The Cloud\(^{\mathrm{TM}}\)

BIG DATA!!1!!

Data Science Is:
- Gathering data that matters.
- Asking questions that matter about your data.
- Choosing appropriate methods to answer those questions.
- Implementing solutions that meet stakeholder needs.
You Can Do Data Science With:
- A pencil and paper.
- A calculator.
- Excel.
- Coding:
R, Julia, Python….
Why Not Excel?
- Flexibility:
- Limited statistics/ML
- Poor visualizations
- Reproducibility:
- Platform limited
- Can’t version control
- Difficult to inspect code/troubleshoot
- Scalability:
- Excel can handle ~1M rows at best
Why Code?
- Flexibility:
- “There’s a package for that”
- Custom analysis and visualization
- Reproducibility:
- Cross-platform, often FOSS
- Version-control with
git
- Easy to inspect code
- Scalability:
R and python: ~10M rows easily on a laptop.
Version Control with git
- Break your workflow into manageable stages; easily collaborate; access cool code.
- Promote your brand: share your work, build a portfolio, host your website.
- Used at: Google, Facebook, Netflix, Amazon, Apple, Twitter, Microsoft… (source)
Data Analysis with R
R is the best language in the world for learning data science.
R is one of the best languages in the world for doing data science.
R tends to be preferred in academia and among “statisticians,” while python is more popular among “computer scientists” and “data scientists”
- Most practicing data scientists know and use both.
Optimization with Julia and JuMP
Julia is high-performance, open-source dynamic language for technical computing – easy writing, fast compute times.
- Developed at MIT.
JuMP is a package for optimization in Julia – developed by ORC students!
- Not everyone uses
Julia…yet.
…yes, there will be an opportunity to learn Python later in the semester.
What can you pick up in two days?
- You are not going to become an expert in two days.
- But…
- You will know the basic concepts and vocabulary of data science – enough to employ the most important skill of all.
The most important skill of all…

The most important skill of all…
Gameplan
- Today: Version Control, Basic Data Analysis and Visualization in
R, RMarkdown.
- Tomorrow: Optimization in
Julia and JuMP, selected presentations.
- Both days: mini-project, partner work, lots of exercises.
Exercise 0
- Look left.
- Look right.
- Pick a partner (groups of 3 are fine).
- Give them a professional, yet friendly smile.
- You are going to need them soon.